In this course the aim is to practice coding, recoding, programming and to analyze and visualize open data.
Link to my github repository:
https://github.com/hhautakangas/IODS-project
## Heidi Hautakangas, 29.01.2017
## Week 2: Regression and model validation
# DATA WRANGLING
# 2. Read data
LEARN <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", as.is=T, header=T)
dim(LEARN)
str(LEARN)
attach(LEARN)
# Data contains 183 observations and 60 variables. 59 variables are integers and one variable is character.
# 3. Subsetting data gender, age, attitude, deep, stra, surf and points by combining questions in the learning2014 data
# check names
names(LEARN)
# create new variable Deep (back to original scale divide by 12)
d_sm <- D03+D11+D19+D27
d_ri <- D07+D14+D22+D30
d_ue <- D06+D15+D23+D31
LEARN$Deep <- (d_sm+d_ri+d_ue)/12
# create new variable Surf (back to original scale divide by 12)
su_lp <- SU02+SU10+SU18+SU26
su_um <- SU05+SU13+SU21+SU29
su_sb <- SU08+SU16+SU24+SU32
LEARN$Surf <- (su_lp+su_um+su_sb)/12
#create new variable Stra (back to original scale divide by 8)
st_os <- ST01+ST09+ST17+ST25
st_tm <- ST04+ST12+ST20+ST28
LEARN$Stra <- (st_os+st_tm)/8
# create subset that contains following variables: gender, Age, Attitude, Points, Deep, Stra, Surf
# and exclude observations where points is zero
LEARNsub <- subset(LEARN, Points != 0, select=c(gender, Age, Attitude, Points, Deep, Stra, Surf))
# check subset dimension, should be 166 obs and 7 variables
dim(LEARNsub)
# 3. Set iods folder as working directory
setwd("~/Documents/IODS-project")
# write subset as a new file in the data folder
write.table(LEARNsub, "~/Documents/IODS-project/data/learning2014sub.txt", quote=FALSE, row.names = FALSE, col.names = TRUE)
# read data again
LEARNsub2 <- read.table("data/learning2014sub.txt", as.is=TRUE, header=TRUE)
# check that the structure is ok
str(LEARNsub2)
head(LEARNsub2)
Dataset LEARN includes 166 observations and 7 variables. Variables are gender, Age, Attitude, Points, Deep, Stra and Surf, of which gender is a character string coded as F=female and M=male. Other variables are integers. Age is reported in years derived from the date of birth and mean age of the data is 26 years. Age interval is from 17 years to 55 years. Attitude measures global attitude towards statistics, and maximum points is 55 and minimum 14 points, mean scores being 31,43 and median 32. Points variable contains total exam points. The mean is 22,72 and the points varies between 7 to 33. Variables Deep, Surf and Stra are sumvariables which describe the type of approach in learning: either deep approach, or surface or strategic approach. Scale for these variables is from 1 to 5.
# read data
LEARN <- read.table("data/learning2014sub.txt", as.is=TRUE, header=TRUE)
# attach the data so there is no need to refer to the dataset used
attach(LEARN)
# structure and dimension of the data
dim(LEARN)
## [1] 166 7
str(LEARN)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ Deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ Stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ Surf : num 2.58 3.17 2.25 2.25 2.83 ...
# summary of the data
summary(LEARN)
## gender Age Attitude Points
## Length:166 Min. :17.00 Min. :14.00 Min. : 7.00
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00
## Mode :character Median :22.00 Median :32.00 Median :23.00
## Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :55.00 Max. :50.00 Max. :33.00
## Deep Stra Surf
## Min. :1.583 Min. :1.250 Min. :1.583
## 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417
## Median :3.667 Median :3.188 Median :2.833
## Mean :3.680 Mean :3.121 Mean :2.787
## 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167
## Max. :4.917 Max. :5.000 Max. :4.333
Plot 1. contains boxplots, distributions and scatter plots between all variables classified by gender. Gender is not distributed evenly, there are almost twice as many female observations (n= 110) than male observations (n=56). Most of the scatterplots shows that there is very negligible correlation between variables. Exceptions are between variables Attitude and Points with correlation 0.437, and between Deep and Surf with negative correlation 0.324. From both density and box plots, it can be seen, that there is difference between answers of male and female groups. This can also be observed from the correlations: for example, between variables Deep and Surf, males has strong negative correlation (-0.622), whereas correlation in females is negligible (-0.087). All distributions of female observations has more than one peak, so it can be assumed that they are not normally distributed. Also for male, distributions of Attitude and Points has clearly several peaks. Age is positively skewed in both genders.
## plot with title added
p <- ggpairs(LEARN, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
LEARNp <- p + ggtitle("Plot 1. Learning 2014 data summary")
LEARNp
At first, to model factors that could possibly explain success in exams, I fitted a regression model with the following three explanatory variables: Age, gender and Attitude, Points being a dependant variable. These results can be seen above (lm1). Only Attitude (p<8.34e-09) of the explanatory variables and regression intercept (p<8.34e-09) were statistically significant. 0.36 gain in the global attitude towards statistics increases exam points by one. Because there is no statistically significant relationship between Points and Age or between Points and gender, I removed them from the model and fitted a new one with only Attitude as an explanatory variable (lm2). Coefficient of Attitude remains almost the same: 0.35 (p<-4.12e-09) gain in the attitude increasing exam points by one. Attitude and Points scatter plot is presented in the Plot 2.
# linear regression
# dependent variable Points
# explanatory variables Age, gender and Attitude
lm1 <- lm(Points ~ Age + gender + Attitude, data=LEARN)
print(summary(lm1))
##
## Call:
## lm(formula = Points ~ Age + gender + Attitude, data = LEARN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## Age -0.07586 0.05367 -1.414 0.159
## genderM -0.33054 0.91934 -0.360 0.720
## Attitude 0.36066 0.05932 6.080 8.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
# second model Attitude as an explanatory variable
lm2 <- lm(Points ~ Attitude, data=LEARN)
print(summary(lm2))
##
## Call:
## lm(formula = Points ~ Attitude, data = LEARN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
# scatter plot between Attitude and Points
p2 <- qplot(Attitude, Points, data = LEARN) + geom_smooth(method = "lm")
p3 <- p2 + ggtitle("Plot 2. Student's attitude vs. exam points")
p3
T-test tests an effect of a given variable to the explanatory variable by testing if the variable differs from zero statistically significantly, and produces T-value that is used to estimate if this null hypothesis can be rejected or accepted. In regression analysis, F-test tests whether explanatory variables can be used to model the variation of the dependant variable. For both models, F-statistics was statistically significant: in the first model p-value was less than 5.54e-08, and in the second model p-value was under 4.1e-09. F-test and R^2 are used to test power of the test. R^2 is the coefficient of determination which describes the proportion of the variation of the dependant variable that can be explained by explanatory variables in the regression model. It varies between 0 and 1, small R^2 indicating that explanatory variables explains only small amount of the variation of the dependant variable and high R^2 vice versa. In the first model, R^2 was 0.20, whereas in the second model it was 0.19. Adjusted R^2 considers also the amount of the explanatory variables and is used to compare the results of different regression models. Adjusted R^2 was 0.187 in the first model and 0.186 in the second model with only one explanatory variable.
Residual standard error describes standard deviation of the residuals and affects to the power of the test. The higher deviation and thus residual standard error, the smaller is the power of the test. Both of the models has quite big residual standard errors, approximately 5.32.
Linear regression assumes that the relationship between variables is linear. Errors are assumed to be normally distributed and not correlated. Homoskedasticity of the errors is also assumed, meaning that variance is constant and does not depend on the observation. Heteroskedasticity can be though handled, for example by weighting variables by the inverse of the variance. Leverage measures the amount of impact a single observation has on the model, and residuals vs leverage plot can be used to identify observations with unusually high impact.
For model validation, I examined these assumptions by producing three plots from the second model: Residuals vs Fitted, Normal QQ-plot and Residuals vs Leverage. Plot 3b. shows that the normality assumption of the errors holds. Plot 3a. is used to examine homoskedasticity. Observations are scattered and do not show any systematic pattern, so the homoskedasticity assumption is also valid. Plot 3c. shows that the leverage is regular and there is no observation with unusually high impact. Model is thus valid.
## Plots for validation
par(mfrow= c(2,2))
plot(lm2, which=c(1), main="Plot 3a. Model validation")
plot(lm2, which=c(2), main="Plot 3b. Model valitadion")
plot(lm2, which=c(5), main="Plot 3c. Model validation")
ALC data includes 35 variables with 385 observations. There are some background information, such as sex, age, address, family size and parent’s job and education. Main interest is in the study habits and grades and factors that might affect the study success. Habits of alcohol use are also examined. There are three types of variables: 1. binary variables for example sex, relationship status and extra-curricular activities, 2. numeric variables for example age and number of absences and 3. nominal variables, for example parent’s job. Age interval is from 15 to 22 and median age is 17.
# names and summary
names(ALC)
dim(ALC)
str(ALC)
summary(ALC)
To study possible factors that might have relationship with high/low alcohol consumption, I chose following four variables: sex, absences, romantic and studytime. Sex and romantic are binary variables, and studytime and absences numeric variables. Hypotheses about their relationship between alcohol consumption are:
Sex:
H0: Sex does not affect on the alcohol consumption,
H1: Alcohol consumption differs between males and females
Romantic:
H0: Being in a romantic relationship does not affect on alcohol consumption
H1: Being in a romantic relationship lessens the alcohol consumption
Absences:
H0: There is no relatioship between alcohol consumption and number of absences
H1: Those whit higher alcohol consumption have more absences
Studytime:
H0: There is no relationship between studytime and alcohol consumption
H1: Those who study more use less alcohol
At start, these relationships are examined from graphics and tables.
# tables and plots
# table 1
ALC %>% group_by(high_use, sex) %>% summarise(count = n())
## Source: local data frame [4 x 3]
## Groups: high_use [?]
##
## high_use sex count
## <lgl> <chr> <int>
## 1 FALSE F 156
## 2 FALSE M 112
## 3 TRUE F 42
## 4 TRUE M 72
## barplot SEX
plot(factor(high_use)~factor(sex), xlab="Sex", ylab="high use of alcohol", main="Plot 1. Sex of the students")
Table 1. shows that in both sexes there are more of those with low alcohol comsumption, but the amount with high use in males is higher (72/184) than in females (42/198). This can be seen also in the plot 1. These findings support the alternative hypothesis that there is difference between males and females.
# table 2
ALC %>% group_by(high_use, romantic) %>% summarise(count = n())
## Source: local data frame [4 x 3]
## Groups: high_use [?]
##
## high_use romantic count
## <lgl> <chr> <int>
## 1 FALSE no 180
## 2 FALSE yes 88
## 3 TRUE no 81
## 4 TRUE yes 33
# barplot ROMANTIC
plot(factor(high_use)~factor(romantic),xlab="in a relationship", ylab="high use of alcohol", main= "Plot 2. Relationship and high alcohol use")
From plot 2 it can be seen that there is no difference between relationship status and high alcohol consumption. From the same plot and also from table it can be also seen that there are less people in a romantic relationship than those who are in a romantic relationship (261:121). These findings supports the null hypothesis.
# table 3
ALC %>% group_by(high_use) %>% summarise(count = n(), mean_absences=mean(absences))
## # A tibble: 2 <U+00D7> 3
## high_use count mean_absences
## <lgl> <int> <dbl>
## 1 FALSE 268 3.705224
## 2 TRUE 114 6.368421
# boxplot ABSENCES
g2 <- ggplot(ALC, aes(y =absences, x=high_use))
g2 + geom_boxplot() + ggtitle("Plot 3. Number of absences")
Table 3 shows the mean absences which seem to differ between high and low users, so that the ones who have high alcohol consumption have almost two times more absences compared to the ones with low alcohol consumption. This can be observed also from the plot 3. These support the alternative hypothese about high users having more absences.
#table 4
ALC %>% group_by(high_use) %>% summarise(count = n(), mean_weekly_study_time=mean(studytime))
## # A tibble: 2 <U+00D7> 3
## high_use count mean_weekly_study_time
## <lgl> <int> <dbl>
## 1 FALSE 268 2.149254
## 2 TRUE 114 1.771930
# barplot STUDYTIME
spineplot(factor(studytime)~factor(high_use), xlab="high use of alcohol", ylab="weekly study time", main="Plot 4. Weekly study time")
Table 4. shows that the ones with low alcohol consumption uses on average more hours to study per week than the ones with high consumption. Difference can be seen in the plot 4. This suggests that the null hypothesis could possibly be rejected. To confirm these findings, they need to be tested also statistically.
Because the responce variable high use is binary, these relationships can be best examined by logistic regression. From the summary of the fitted model (above), it can be seen that all other variables except romantic are statistically significant with p-value < 0.05. If stringent treshold is desired, are variables absences and sex statistically significant with p-value < 0.01. but studytime do not reach anymore acquired level. Studytime has negative coefficient and both absences and sex positive. Deviance is used to examine the goodness of fit of the model. When the model fits perfectly to the data, is the deviance zero. In a binomial model, there is overdispersion if the deviance is much higher than the degrees of freedom. Null deviance is 465.68 on 381 degrees of freedom which suggest that there is some overdispersion in the model. Odds ratio for absences is 1.09 with (1.05,1.45) confidence interval of 95%, which means that there is 1.09 higher risk to have high alcohol consumption when the number of absences are higher compared to the lower amount of absences. There is 2.2 higher risk for males to have high alcohol consumption than females with (1.35,3.62) confidence interval of 95%. Whereas weekly studytime has an opposite risk, those who study more has 0.67 lower risk to have high alcohol consumption with (0.49,0.91) 95%-confidence interval.These results support earlier observations from the plots and tables.
# logistic regression model with all variables
m1 <- glm(high_use ~ absences + studytime + romantic + sex, data = ALC, family = "binomial")
# a summary of the model
summary(m1)
##
## Call:
## glm(formula = high_use ~ absences + studytime + romantic + sex,
## family = "binomial", data = ALC)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1456 -0.8356 -0.6145 1.0895 2.0865
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.87199 0.41840 -2.084 0.03715 *
## absences 0.09026 0.02314 3.900 9.61e-05 ***
## studytime -0.39477 0.15916 -2.480 0.01312 *
## romanticyes -0.20370 0.26174 -0.778 0.43643
## sexM 0.78878 0.25124 3.140 0.00169 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 422.66 on 377 degrees of freedom
## AIC: 432.66
##
## Number of Fisher Scoring iterations: 4
# coefficients of the model
coef(m1)
## (Intercept) absences studytime romanticyes sexM
## -0.87198682 0.09025851 -0.39477449 -0.20369623 0.78877844
# compute odds ratios (OR)
OR <- coef(m1) %>% exp
# compute confidence intervals (CI)
CI <-confint(m1) %>% exp
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.4181200 0.1829448 0.9470310
## absences 1.0944572 1.0481820 1.1479059
## studytime 0.6738320 0.4886211 0.9135379
## romanticyes 0.8157101 0.4840025 1.3541213
## sexM 2.2007065 1.3501218 3.6224923
To compute predictive power of the model, statistically non-significant variable studytime is excluded from the model.
# model with only statistically significant variables
m2 <- glm(high_use ~ absences + studytime + sex, data = ALC, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ absences + studytime + sex, family = "binomial",
## data = ALC)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1771 -0.8487 -0.5971 1.0986 2.1157
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.91782 0.41459 -2.214 0.02684 *
## absences 0.08876 0.02308 3.846 0.00012 ***
## studytime -0.40248 0.15930 -2.526 0.01152 *
## sexM 0.79826 0.25065 3.185 0.00145 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 423.27 on 378 degrees of freedom
## AIC: 431.27
##
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use and add it in the data
probabilities <- predict(m2, type = "response")
ALC <- mutate(ALC, probability = probabilities)
# use the probabilities to make a prediction of high_use
ALC <- mutate(ALC, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = ALC$high_use, prediction = ALC$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 258 10
## TRUE 88 26
# tabulate the target variable versus the predictions
table(high_use = ALC$high_use, prediction = ALC$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67539267 0.02617801 0.70157068
## TRUE 0.23036649 0.06806283 0.29842932
## Sum 0.90575916 0.09424084 1.00000000
## training error
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = ALC$high_use, prob = ALC$probability)
## [1] 0.2565445
# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = ALC, cost = loss_func, glmfit = m2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2591623
There were 0.67 probability (258/385) for both predicted and observed to be false and 0.07 probability (26/382) for both predicted and observed to be true. The probability to false positives were 0.026 (10/382) and 0.23 probability (88/382) for false negatives. Probabilities are presented in the plot 5. Training error of the model is 0.26, and by 10-fold cross-validation, the test error is approximately 0.26 which happens to be same as in the DataCamp model.
# Plot 5. Predictions
g <- ggplot(ALC, aes(x = probability, y = high_use, col= prediction))
g + geom_point()+ggtitle("Plot 5. Prediction vs true values probabilites")
Boston data set about housing values in suburbs of Boston has 14 variables and 506 observations, of which 12 are continuous numerical variables and 2 discrete integers. It includes several variables of possible factors that might influence on housing values which are listed below.
-crim: per capita crime rate by town.
-zn: proportion of residential land zoned for lots over 25,000 sq.ft.
-indus: proportion of non-retail business acres per town.
-chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
-nox: nitrogen oxides concentration (parts per 10 million).
-rm: average number of rooms per dwelling.
-age: proportion of owner-occupied units built prior to 1940.
-dis: weighted mean of distances to five Boston employment centres.
-rad: index of accessibility to radial highways.
-tax: full-value property-tax rate per $10,000.
-ptratio: pupil-teacher ratio by town.
-black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
-lstat: lower status of the population (percent).
-medv: median value of owner-occupied homes in $1000s.
# load the data from MASS package
data("Boston")
## check how the data look like (structure and dimension etc.)
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
To explore variables and their relationship further, I created plots about their distributions and correlations.
# Plots
p <- ggpairs(Boston, mapping = aes(alpha = 0.5), lower = list(combo = wrap("facethist", bins = 20)))
p1 <- p + ggtitle("Housing values in suburbs of Boston")
p1
Figure 1. Boston data variable
# correlation plot
cor_matrix<-cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d",tl.cex=0.6)
Figure 2. Correlations
From distribution plots in Figure 1, it can be seen that almost all variables do not follow normal distribution. Only one variable, average numbers of rooms per dvelling (rm), seems to follow normal distribution. Other variables are either right skewed (crim, zn, chas, nox,dis, lstat and medv), left skewed (age, ptratio, and black) or clearly bimodal (indus, rad and tax). Also there seems to strong correlation between many of the variables. Correlations are plotted also in the Figure 2, where red indicates negative correlation and blue positive correlation and the size and color density how strong the correlation is. Darker and bigger indicates stronger correlation than small and light. For example, between lower status of the population (lstat) and median value of owner-occupied homes in $1000s (medv), has strong negative correlation. Whereas, between index of accessibility to radial highways (rad) and full-value property-tax rate per $10000 (tax) there is strong positive correlation. Whether these observations are statistically significant, can be confirmed by statistical tests.
## summary of the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Skewnes of the variables can also been observed in the summary statistics (above). In crim, interval between min 0.00632 and max 88,97 is wide, but median is only 0.256. Results are difficult to interpret because of the scale used, so the data should be standardized to make it easier (below). Now the results are comparable with each other and easier to interpret.
# standardize data and print out the summary
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
#names(boston_scaled)
Linear discrimant analysis (LDA) is a classification method that can be used to find those variables that best separate classes, to class prediction of new data and for dimensional reduction. Target variable can be either binary or multiclass variable. LDA assumes that variables follow normal distribution and each class share same covariance matrix. To perform with the target variable crime, data is divided into train and test.
#create a categorical variable
scaled_crim <- boston_scaled$crim
# summary of the scaled_crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# Divide dataset to test and train sets
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# train set
train <- boston_scaled[ind,]
# and test set
test <- boston_scaled[-ind,]
# linear discriminant analysis on the train set
lda.fit <- lda(crime ~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2425743 0.2376238 0.2574257
##
## Group means:
## zn indus chas nox rm
## low 0.96370593 -0.9349023 -0.123759247 -0.8717615 0.4072208
## med_low -0.09193924 -0.2626959 0.008892378 -0.5831374 -0.1716203
## med_high -0.36039533 0.1430916 0.096774079 0.4040067 0.1352695
## high -0.48724019 1.0170690 -0.045188669 1.0583702 -0.3906955
## age dis rad tax ptratio
## low -0.8854340 0.9087221 -0.7001717 -0.7461622 -0.472305062
## med_low -0.3327573 0.3859220 -0.5447506 -0.4326087 -0.008683315
## med_high 0.4706783 -0.3797882 -0.4507052 -0.3401060 -0.342730070
## high 0.8349874 -0.8686611 1.6386213 1.5144083 0.781350740
## black lstat medv
## low 0.37396993 -0.74800557 0.49638526
## med_low 0.34418805 -0.09873939 -0.03296847
## med_high 0.08516851 0.05134030 0.20505842
## high -0.87125649 0.96002402 -0.74930762
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.124246792 0.520439555 -0.9600053
## indus 0.029572580 -0.261316088 0.3847656
## chas 0.006497286 0.052844262 0.1734815
## nox 0.341263088 -0.844055094 -1.2504420
## rm 0.059441582 -0.080698025 -0.1130264
## age 0.240964693 -0.465164087 -0.1639672
## dis -0.083658591 -0.277492263 0.1744373
## rad 3.626853316 0.868326264 -0.3147881
## tax 0.077611001 0.141768546 0.6985628
## ptratio 0.108432371 -0.004187035 -0.2008792
## black -0.085616392 0.050547673 0.1518625
## lstat 0.186191942 -0.223412863 0.3368574
## medv 0.015172975 -0.474465114 -0.1615891
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9576 0.0309 0.0115
# Plot
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes, main="LDA")
lda.arrows(lda.fit, myscale = 2)
Figure 3. LDA
Figure 3 represents the LDA results, in which different colors represent different classes and the arrows show the impact of each predictor variables. Accessibility to radial highways seems to be strongly associated with high crime rates.
Classes can be predicted from the LDA model by using test data and this infromation can be used to test the performance of the model. Correct values are crosstabulated against predicted values.
# save test crime
correct_classes <- test$crime
## test set for prediction
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# and cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 6 0 0
## med_low 7 15 6 0
## med_high 0 10 17 3
## high 0 0 0 23
Model performs well with high class, all predictions are correct. Prediction of low is correct 16 times but predicted 2 times to be medium low. Medium low is predicted to be correct 15 times and wrong 14 times, so model performs weakly with then medium low values. Medium high is predicted to be correct 20 times and incorrect 4 times. Model thus performs well with all other except medium low.
K-means is a method for clustering which uses the similarity of the objects to grouping or clustering. Number of clusters that best fit for the data can be determined for example by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. Optimal number can be seen from figure 4 from the point where WCSS drops radically. This happens when there are 2 clusters.
data("Boston")
#scale
boston_scaled <- as.data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results to find optimal number of clusters
plot(1:k_max, twcss, type='b')
Figure 4. K-means clustering
# Plot
# k-means clustering
km <-kmeans(dist_eu, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Figure 5. Variables by two cluster
In Figure 5, scaled variables are presented by the cluster. The difference between the clusters can be seen in almost all pictures. There is clear difference for example crime and accessibility to radial highways which was seen also in the LDA plot.